dirichlet noise added to prior probabilities during self play #186

jrbuhl93 · 2020-05-06T20:45:12Z

I don't completely understand the mathematics of the dirichlet distribution, but I arbitrarily chose an alpha of 0.6 as the default value for dirichletAlpha in main.py because that seems approximately right for othello. The one thing I am wondering about my code is that I am not sure when to apply it to the prior probabilities generated by the neural network for a board s. Currently I am applying it at all the search board nodes because otherwise only the first move would get the dirichlet noise due to the caching of the prior probabilities and policies.

But the paper seems to imply it should only be applied at the root node:

Additional exploration is achieved by adding Dirichlet noise to the prior probabilities in the root node s0, specifically P(s, a) = (1 − ε)p + εη, where η ∼ Dir(0.03) and ε = 0.25; this noise ensures that all moves may be tried, but the search may still overrule bad moves.

Thoughts and feedback are appreciated.

NMO13 · 2020-05-10T12:40:30Z

Thanks for sharing your code!
Some remarks:

The paper states: "Additional exploration is achieved by adding Dirichlet noise to the prior probabilities in the root node..." So the noise should only be added to the root node
You should only add noise to valid moves. So something more like np.random.dirichlet([self.args.dirichletAlpha] * np.count_nonzero(valids))

jrbuhl93 · 2020-05-10T14:18:31Z

I made edits to my code to add in your remarks. Let me know if you believe it is behaving properly now.

jrbuhl93 · 2020-05-14T00:52:15Z

I was able to get a Connect4 agent playing perfectly as far as I have explored. It has 100% wins moving first in the evaluation loop. It also does not make any moves that would result in a loss or a draw when I play against it with the aid of https://connect4.gamesolver.org/. I used the updated dirichlet algorithm I made with NMO13's input.

Other changes to achieve this:
-I added code that would randomly flip the board during self play like what the alpha go zero paper describes.
-I changed the neural network architecture to be the stack described in the alpha go zero paper, but I only used 6 residual blocks instead of 19 or 39 to save on computation.
-I added a function to convert the input to the neural network to be 6x7x2. The first 6x7 is for the current players pieces denoted by ones, and the second is for the second players pieces also denoted by ones. I did this because I believe overloading the network to detect both 1s and -1s may inhibit its ability to process a position. This matches with the alpha zero paper having separate layers in the stack for each player's pieces and will also scale to games with multiple types of pieces which I plan to now implement.

My config was as follows:
'numIters': 30,
'numEps': 100,
'tempThreshold': 10,
'updateThreshold': 0.55,
'maxlenOfQueue': 200000,
'numMCTSSims': 150,
'arenaCompare': 6,
'cpuct': 1.25,
'dirichletAlpha': 1.4,

I don't know the exact iterations due to having to restart the code because of memory leaks in my current implementation of self play parallelization, but it was probably ~75.

jrbuhl93 · 2020-05-14T16:23:59Z

I believe I spoke too soon. The second player has begun playing alternative moves, allowing the training to continue for new types of positions.

MCTS.py

NMO13 · 2020-05-16T12:59:07Z

I added some comments to the code.

Other changes to achieve this:
-I added code that would randomly flip the board during self play like what the alpha go zero paper describes.
-I changed the neural network architecture to be the stack described in the alpha go zero paper, but I only used 6 residual blocks instead of 19 or 39 to save on computation.
-I added a function to convert the input to the neural network to be 6x7x2. The first 6x7 is for the current players pieces denoted by ones, and the second is for the second players pieces also denoted by ones. I did this because I believe overloading the network to detect both 1s and -1s may inhibit its ability to process a position. This matches with the alpha zero paper having separate layers in the stack for each player's pieces and will also scale to games with multiple types of pieces which I plan to now implement.

Sounds interesting but I would suggest that in a different PR bc. otherwise it gets too messy.

And please resolve the conflicts, thanks. ;-)

… mcts.search

jrbuhl93 · 2020-05-16T15:26:58Z

Thank you once again for your comments. I removed the bug (quite a large one) caused by calling self.dirichlet_noise instead of dirichlet_noise in the code I committed to my pull request. I also fixed this in the code for the model I am training now. I saw the results of your dirichlet_noise experiment and am glad the evidence matches our intuitions. How similar is your code for implementing dirichlet noise?

Also let me know what you think about the i==0 and leaf node situations.

NMO13 · 2020-05-16T18:33:58Z

How similar is your code for implementing dirichlet noise?

Pretty similar. The main difference is that I don't override Ps, as already mentioned.
I do:

        p = np.copy(self.Ps[s])
        if self.add_noise and depth == 0:
            self.add_dirichlet(p, valids)

Notice the np.copy. And then I do:

if (s,a) in self.Qsa:
                    u = self.Qsa[(s,a)] + self.args.cpuct*p[a]*math.sqrt(self.Ns[s])/(1+self.Nsa[(s,a)])
                else:
                    u = self.args.cpuct*p[a]*math.sqrt(self.Ns[s] + EPS)     # Q = 0 ?

That should assure that the probabilities don't drift too far away.

jrbuhl93 · 2020-05-18T20:28:39Z

I think I like my implementation of dirichlet noise more, because I think it is more likely deep mind implemented it like I did based on my understanding of the paper. It seems to me to be better to search a path repeatedly in depth (based on if that path got a big increase in P from the dir noise). If you randomly assign dir noise for each simulation, I think the benefit will be more like adding uniform noise because the variation of the noise will be less with the repeated sampling.

Do you agree or disagree?

NMO13 · 2020-05-19T10:26:31Z

Ok so my inutition was to add, as you already mentioned, more like unifrom noise. Just to make sure that the other paths are not "starving". I am just thinking that your solution of repeatedly overriding Ps could mess the probabilities too much up. But of course, I might be wrong and your solution works even better.

NMO13 · 2020-05-20T08:44:47Z

I am currently training your version. I am curious how it will perform against my version.

jrbuhl93 · 2020-05-20T15:21:00Z

I am currently training your version. I am curious how it will perform against my version.

To minimize variables, you should use the dirichlet alpha you used to train your other model. I got very far with 1.4, but it got stuck for ~15 iterations on all ties. I just dropped the alpha to 0.9, so hopefully this is enough variation for it to find the optimal solution. What alpha did you use when you trained?

NMO13 · 2020-05-24T14:41:19Z

I used 0.6. And for the first 100 iterations it finds a better model every ~4 iterations.
So here are some results. I let them pit against each other for 50 games.

Comparing your version vs mine:
(13, 24, 13) -> your version is stronger than mine.

Comparing your version vs the original one:
(10, 32, 8) -> your version is much stronger

Conclusion:
Of course, I just tested your algo for one only game but I am highly confident that it works for other games equally well.
You did well. I would merge your algorithm into the master. @suragnair what do you think?

Copy paste from suragnair/alpha-zero-general#186

greghe · 2023-04-25T03:43:50Z

MCTS.py

+        if dirichlet_noise:
+            self.applyDirNoise(s, valids)
+            sum_Ps_s = np.sum(self.Ps[s])
+            self.Ps[s] /= sum_Ps_s      # renormalize


applyDirNoise has already calculated a normalized sequence when it added two normalized sequences (scaled by 0.75 and 0.25, respectively), so there is no need to renormalize Ps[s] here – since doing so will have no effect

dirichlet noise added to prior probabilities during self play

d61f4ef

jrbuhl93 added 2 commits May 10, 2020 09:13

apply dirichlet noise only to valid moves and only at S0

3bb5260

debug calling search with dir_noise

a0c40eb

NMO13 reviewed May 16, 2020

View reviewed changes

MCTS.py Show resolved Hide resolved

NMO13 reviewed May 16, 2020

View reviewed changes

MCTS.py Show resolved Hide resolved

NMO13 reviewed May 16, 2020

View reviewed changes

MCTS.py Outdated Show resolved Hide resolved

use local paramter dirichlet_noise instead of self.dirichlet_noise in…

7cb2e17

… mcts.search

qthequartermasterman pushed a commit to qthequartermasterman/alpha-zero-general that referenced this pull request Jun 26, 2020

Added dirichlet noise by jrbuhl93 (suragnair#186)

b173f6b

cestpasphoto added a commit to cestpasphoto/alpha-zero-general that referenced this pull request Feb 11, 2021

Add Dirichlet Noise

b6c6ed9

Copy paste from suragnair/alpha-zero-general#186

greghe reviewed Apr 25, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dirichlet noise added to prior probabilities during self play #186

dirichlet noise added to prior probabilities during self play #186

jrbuhl93 commented May 6, 2020

NMO13 commented May 10, 2020

jrbuhl93 commented May 10, 2020

jrbuhl93 commented May 14, 2020 •

edited

Loading

jrbuhl93 commented May 14, 2020

NMO13 commented May 16, 2020 •

edited

Loading

jrbuhl93 commented May 16, 2020 •

edited

Loading

NMO13 commented May 16, 2020 •

edited

Loading

jrbuhl93 commented May 18, 2020

NMO13 commented May 19, 2020 •

edited

Loading

NMO13 commented May 20, 2020

jrbuhl93 commented May 20, 2020

NMO13 commented May 24, 2020 •

edited

Loading

greghe Apr 25, 2023 •

edited

Loading

dirichlet noise added to prior probabilities during self play #186

Are you sure you want to change the base?

dirichlet noise added to prior probabilities during self play #186

Conversation

jrbuhl93 commented May 6, 2020

NMO13 commented May 10, 2020

jrbuhl93 commented May 10, 2020

jrbuhl93 commented May 14, 2020 • edited Loading

jrbuhl93 commented May 14, 2020

NMO13 commented May 16, 2020 • edited Loading

jrbuhl93 commented May 16, 2020 • edited Loading

NMO13 commented May 16, 2020 • edited Loading

jrbuhl93 commented May 18, 2020

NMO13 commented May 19, 2020 • edited Loading

NMO13 commented May 20, 2020

jrbuhl93 commented May 20, 2020

NMO13 commented May 24, 2020 • edited Loading

greghe Apr 25, 2023 • edited Loading

Choose a reason for hiding this comment

jrbuhl93 commented May 14, 2020 •

edited

Loading

NMO13 commented May 16, 2020 •

edited

Loading

jrbuhl93 commented May 16, 2020 •

edited

Loading

NMO13 commented May 16, 2020 •

edited

Loading

NMO13 commented May 19, 2020 •

edited

Loading

NMO13 commented May 24, 2020 •

edited

Loading

greghe Apr 25, 2023 •

edited

Loading